vtgate: use fixed retry interval for health check reconnection#19967
vtgate: use fixed retry interval for health check reconnection#19967khkim6040 wants to merge 9 commits into
Conversation
Replace exponential backoff in checkConn with a fixed retry interval (default 5s). Previously, the retry delay doubled on each failure up to healthCheckTimeout (default 60s), causing vtgates to take up to 60s to rediscover tablets after prolonged outages. Fixes vitessio#19894 Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
Add TestHealthCheckRetryDelayIsFixed to verify that repeated stream failures do not cause the retry delay to grow. The test sends multiple consecutive errors and asserts that the tablet is still rediscovered within a short, fixed window once it recovers. Also update a stale comment referencing exponential backoff in TestHealthCheckTimeout. Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
There was a problem hiding this comment.
Pull request overview
This PR changes VTGate’s tablet healthcheck reconnection behavior to use a fixed retry interval (instead of exponential backoff) so recovered tablets are rediscovered promptly and consistently across vtgate instances.
Changes:
- Removed exponential backoff logic in
tabletHealthCheck.checkConn, keeping retries at a fixedhc.retryDelay. - Added a regression test intended to verify the retry delay remains fixed across repeated stream failures.
- Updated a stale test comment referencing exponential backoff.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| go/vt/discovery/tablet_health_check.go | Removes exponential backoff in the StreamHealth reconnect loop in favor of a fixed retry interval. |
| go/vt/discovery/healthcheck_test.go | Adds a new regression test for fixed retry delay behavior and updates an existing test comment. |
| Serving: true, | ||
| PrimaryTermStartTimestamp: 0, | ||
| RealtimeStats: &querypb.RealtimeStats{ReplicationLagSeconds: 1, CpuUsage: 0.2}, | ||
| } | ||
| input <- shr |
There was a problem hiding this comment.
input <- shr can block until fakeConn.StreamHealth is running again (because input/hcChan is unbuffered). That means the test’s 100ms timeout starts after the reconnect happens, so it won’t actually catch an exponential-backoff regression. Make the health stream channel buffered (e.g. size 1) or send shr asynchronously and start the timer before/while sending so the assertion includes the retry sleep time.
| // Streaming RPC failed e.g. because vttablet was restarted or took too long. | ||
| // Sleep until the next retry is up or the context is done/canceled. | ||
| // We use a fixed retry interval instead of exponential backoff so that | ||
| // vtgate rediscovers recovered tablets promptly. See #19894. | ||
| select { | ||
| case <-thc.ctx.Done(): | ||
| return | ||
| case <-time.After(retryDelay): | ||
| // Exponentially back-off to prevent tight-loop. | ||
| retryDelay *= 2 | ||
| // Limit the retry delay backoff to the health check timeout | ||
| if retryDelay > hc.healthCheckTimeout { | ||
| retryDelay = hc.healthCheckTimeout | ||
| } | ||
| } |
There was a problem hiding this comment.
With the switch to a fixed retry interval, retryDelay now directly controls how aggressively vtgate reconnects after stream failures. However, vtgate’s --healthcheck-retry-delay default is currently 2ms (see go/vt/vtgate/vtgate.go and generated flags docs), which would cause very tight reconnect loops and time.After allocations during outages. Consider updating the vtgate default to a saner value (e.g. discovery’s 5s default) and/or enforcing a reasonable minimum retryDelay here to avoid runaway retries.
|
Promptless prepared a documentation update related to this change. Triggered by this PR Added documentation to the vtgate configuration guide explaining that vtgate now uses a fixed retry interval (controlled by |
|
This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:
If no action is taken within 7 days, this PR will be closed. |
Replace the fixed retry interval in checkConn with the same base delay plus +/-25% jitter, and remove the now-dead local retryDelay variable and its no-op reset. A purely fixed interval re-synchronizes vtgate instances that started retrying together during a shared outage, so they would all reconnect to a recovered tablet in lockstep. Jittering the interval spreads those reconnects out while still rediscovering recovered tablets promptly. See vitessio#19894. Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
Rename TestHealthCheckRetryDelayIsFixed to TestHealthCheckRetryDelayIsBounded since the interval now carries jitter rather than being strictly fixed, and update its comments accordingly. Replace t.Fatal with require.Fail per the project test conventions, and widen the timeout to 30s: the success path returns almost immediately, so a healthy run never approaches it, while a regression that re-grows the delay still trips it without CI flakiness. Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
|
Reworked this from a pure fixed interval to a fixed base (5s) with a +- 25% jitter. A perfectly fixed interval re-synchronizes the vtgates that started retrying together during a shared outage, so they'd all reconnect to a recovered tablet in lockstep. The jitter keeps rediscovery fast (#19894) while spreading those reconnects out, so no thundering herd. The dead @deepthi when you have a moment — I went with fixed+jitter here, but if you'd prefer keeping the backoff with a lower cap instead, happy to switch. Does this direction look right to you? |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #19967 +/- ##
===========================================
+ Coverage 69.67% 84.42% +14.74%
===========================================
Files 1614 319 -1295
Lines 216793 57403 -159390
===========================================
- Hits 151044 48460 -102584
+ Misses 65749 8943 -56806
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
mattlord
left a comment
There was a problem hiding this comment.
In go/vt/vtgate/vtgate.go:88, go/vt/discovery/tablet_health_check.go:326 the PR makes --healthcheck-retry-delay the steady-state retry interval, but vtgate’s default is still 2ms, not the 5s claimed in the PR/issue. With jitter, a default vtgate will now retry every ~1.5-2.5ms per down tablet for the full outage, creating a very hot reconnect loop and time.After allocation churn across the fleet. Please update the vtgate default to discovery.DefaultHealthCheckRetryDelay/5s and regenerate flag docs, or enforce a sane minimum before using the delay as a fixed interval.
In go/vt/discovery/healthcheck_test.go:386-396 the regression test still does not measure the retry sleep after recovery. input <- shr uses an unbuffered channel, so it blocks until the healthcheck goroutine has already reconnected; the timeout starts only after the old exponential backoff would have elapsed. This means the test would pass even if exponential backoff were restored. Make the channel buffered or send the healthy response asynchronously and start timing before the send can unblock.
Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
| func retryInterval(base time.Duration) time.Duration { | ||
| if base <= 0 { | ||
| return base | ||
| } | ||
| // Spread uniformly within [base-base/4, base+base/4). | ||
| return base - base/4 + time.Duration(rand.Int64N(int64(base/2))) | ||
| } |
| const base = 5 * time.Second | ||
| lower := time.Duration(float64(base) * 0.75) | ||
| upper := time.Duration(float64(base) * 1.25) |
| start := time.Now() | ||
| input <- shr | ||
|
|
||
| // The fixed retry delay (~10ms +/- jitter) rediscovers the tablet within a | ||
| // single cycle. The old exponential backoff would sleep ~320ms after | ||
| // numErrors failures (10ms * 2^5), so a regression that regrows the delay | ||
| // blows past this bound. The 5s arm only trips on a true hang. | ||
| select { | ||
| case result := <-resultChan: | ||
| assert.True(t, result.Serving, "tablet should be serving after recovery") | ||
| assert.Less(t, time.Since(start), 100*time.Millisecond, | ||
| "rediscovery took too long after recovery; retry delay may be growing exponentially") |
| // healthCheckRetryDelay is the fixed interval between healthcheck reconnection | ||
| // attempts. Since it is now a steady-state interval rather than a backoff seed, | ||
| // it defaults to discovery's 5s instead of a few milliseconds. | ||
| healthCheckRetryDelay = discovery.DefaultHealthCheckRetryDelay |
Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
The fixed 5s health check retry interval can leave vtgate without a reconnected stream when the DML runs, so accept "is either down or nonexistent" alongside "wrong tablet type" as the VT15001 reason. Signed-off-by: Gwanho Kim <khkim6040@gmail.com>
| select { | ||
| case <-thc.ctx.Done(): | ||
| return | ||
| case <-time.After(retryDelay): | ||
| // Exponentially back-off to prevent tight-loop. | ||
| retryDelay *= 2 | ||
| // Limit the retry delay backoff to the health check timeout | ||
| if retryDelay > hc.healthCheckTimeout { | ||
| retryDelay = hc.healthCheckTimeout | ||
| } | ||
| case <-time.After(retryInterval(hc.retryDelay)): | ||
| } |
| func retryInterval(base time.Duration) time.Duration { | ||
| // half is the jitter span passed to rand.Int64N, which panics on n <= 0. For | ||
| // a non-positive base, or one so small that base/2 truncates to zero (sub-2ns), | ||
| // there is no room to jitter, so return the base unchanged. | ||
| half := base / 2 | ||
| if half <= 0 { | ||
| return base | ||
| } | ||
| // Spread uniformly within [base-base/4, base+base/4). | ||
| return base - base/4 + time.Duration(rand.Int64N(int64(half))) | ||
| } |
| fc := createFakeConn(tablet, input) | ||
| fc.errCh = make(chan error) | ||
| hc.AddTablet(tablet) | ||
|
|
||
| // Drain the initial not-serving notification from AddTablet. | ||
| <-resultChan | ||
|
|
||
| // Send multiple consecutive stream errors to simulate a prolonged outage | ||
| // where the tablet is unreachable for many retry cycles. | ||
| const numErrors = 8 // with exponential backoff the post-recovery sleep would reach ~1.28s (10ms*2^7) | ||
| for range numErrors { | ||
| fc.errCh <- errors.New("connection refused") | ||
| <-resultChan // drain the not-serving update |
…-backoff-19894 Signed-off-by: Gwanho Kim <khkim6040@gmail.com> # Conflicts: # go/test/endtoend/reparent/newfeaturetest/reparent_with_open_tx_test.go
Description
The
checkConnreconnect loop used to doubleretryDelayon every streamfailure (capping at
healthCheckTimeout, default 60s). After a prolongedtablet outage each vtgate sits at a different point on that backoff curve, so a
recovered tablet is rediscovered anywhere from a few seconds to ~60s later
depending on the vtgate — the stagger described in #19894.
This PR drops the exponential backoff and retries on a fixed base interval
(
--healthcheck-retry-delay, default 5s) with +/-25% jitter. The fixed basekeeps rediscovery prompt; the jitter stops every vtgate from reconnecting to a
recovered tablet in lockstep (a thundering-herd we'd otherwise create by making
the interval perfectly fixed).
Related Issue(s)
Fixes #19894
Checklist
Deployment Notes
--healthcheck-retry-delay(default 5s) now drives a fixed retry interval with+/-25% jitter instead of the initial value of an exponential backoff. Operators
who relied on the backoff growing to throttle retries during long outages may
want to raise it; 5s is fine for most deployments.
AI Disclosure
Written with Claude Code — I gave the direction and chose the design (fixed
interval + jitter over a pure fixed interval); Claude wrote the code and tests.